-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] remove cupy dependency #3625
Conversation
Support evidence from #3617 : this PR solved cupy initialization issue 👍 |
Can we get confirmation from AMD people that we can do the same thing to replace PyTorch RCCL? While I believe we don't have to do this in the current PR, I'd like to get confirmation and find someone who takes that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i made an initial pass and it looks good overall. I think we are maybe lacking a bit more tests in the test_pynccl.py
?
@WoosukKwon @simon-mo thanks for the review! I talked to amd folks to ask for review. Let's see if we can get feedback today. If not, we can merge first, and let them send fix PRs later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! Left some questions.
vllm/worker/model_runner.py
Outdated
# Delete the CUDA graphs before deleting the CuPy NCCL communicator. | ||
# Delete the CUDA graphs before deleting the pynccl communicator. | ||
# NOTE(woosuk): This is necessary because otherwise deadlocks can | ||
# happen. | ||
# FIXME(woosuk): This is a bit hacky. Find a more robust solution. | ||
self.graph_runners.clear() | ||
self.cupy_nccl_backend = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part can be deleted as we remove CuPy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why this code was needed before. So I don't want to change it 🤣 If you have more context, we can discuss if we can delete this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using CuPy + CUDA graphs, deadlocks happen when the CuPy backend is deleted before the CUDA graphs using it are deleted. I actually don't know the reason for this, but this doesn't happen when using NCCL through PyTorch probably because the NCCL communicator managed by PyTorch is deleted at the very end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao could you check whether we can delete this? You can simply run python examples/llm_engine_example.py -tp 2
and see if the process hangs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will have a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I do agree with you that we can remove this code some time later. If you'd like to do so, could you please add TODO
in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the code (removing the above two lines) for 100 times for i in {1..100}; do python3 examples/llm_engine_example.py -tp=2 && echo "Test passed $i times" || break; done
, and I don't see any deadlocks. This gives us confidence in removing the code later. I left a comment there to remove it in v0.4.1 .
Overall I think we should take a small step for every release. Distributed related bugs like hang/deadlock are highly unstable and difficult to test. My plan is to use cc @WoosukKwon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the great work!
Hello everyone, |
Separate code from #3442 , only remove cupy dependency.